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Abstract 

Reliability  is  an  important  concern  in  the  development  of  software  for  modern  sys- 
tems. Some  researchers  have  hypothesized  that  particular  fault-handling  approaches 
or  techniques  are  so  effective  that  other  approaches  or  techniques  are  superfluous. 
The  authors  have  performed  a  study  that  compares  two  major  approaches  to  the 
improvement  of  software,  software  fault  elimination  and  software  fault  tolerance,  by 
examination  of  the  fault  detection  obtained  by  five  techniques:  run-time  assertions, 
multi-version  voting,  functional  testing  augmented  by  structural  testing,  code  read- 
ing by  stepwise  abstraction,  and  static  data-flow  analysis.  This  study  has  focused 
on  characterizing  the  sets  of  faults  detected  by  the  techniques  and  on  characterizing 
the  relationships  between  these  sets  of  faults.  The  results  of  the  study  show  that 
none  of  the  techniques  studied  is  necessarily  redundant  to  any  combination  of  the 
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others.    Further  results  reveal  strengths  and  weaknesses  in  the  fault  detection  by 
the  techniques  studied  and  suggest  directions  for  future  research. 
Key  words:  Assertions,  Back-To- Back  Testing,  Code  Reading,  Evaluation  of  soft- 
ware methodologies,  Fault  elimination,  Fault  tolerance,  N- version  programming, 
Software  experiments,  Static  analysis,  Testing 

1      Introduction 

Reliability  is  an  important  concern  in  the  development  of  software  for  modern  sys- 
tems. Software  reliability  improvement  techniques  may  be  classified  according  to 
the  approach  they  use  to  deal  with  faults  (source-code  defects):  1)  fault-avoidance 
techniques  attempt  to  prevent  the  introduction  of  faults  into  software  during  devel- 
opment; 2)  fault-elimination  techniques  attempt  to  locate  and  remove  faults  from 
software  prior  to  its  use;  3)  fault-tolerance  techniques  attempt  to  prevent  faults 
from  causing  a  program  to  fail. 

Some  researchers  have  hypothesized  that  particular  fault-handling  approaches 
or  techniques  are  so  effective  that  other  approaches  or  techniques  are  superfluous. 
It  seems  important  to  investigate  these  hypotheses  and,  in  general,  to  study  the 
relationship  between  techniques  that  use  different  fault-handling  approaches.  Such 
studies  provide  direction  for  improving  the  techniques  and  for  selecting  a  set  of 
techniques  for  use  on  a  particular  project. 

Research  is  continuing  into  ways  to  evaluate  and  improve  various  software  fault- 
handling  techniques.  However,  the  empirical  studies  examining  these  techniques 
have  largely  compared  them  solely  on  the  number  of  faults  detected,  rather  than 


examining  the  relationship  between  the  sets  of  faults  detected  by  the  techniques. 
An  experiment  that  examines  the  intersection  of  the  sets  of  faults  detected  by 
various  techniques  may  reveal  new  limitations  of  these  techniques,  suggest  research 
directions  to  extend  their  utility,  and  provide  comparative  information. 

The  authors  have  performed  such  a  study,  contrasting  the  fault  detection  capa- 
bility of  a  variety  of  software  fault-tolerance  and  fault-elimination  techniques  includ- 
ing n-version  programming,  run-time  assertions,  functional  testing  augmented  by 
structural  testing,  code  reading  by  stepwise  abstraction  and  static  data-flow  analy- 
sis. These  techniques  were  selected  for  both  practical  and  theoretical  reasons.  From 
a  theoretical  point  of  view,  each  of  these  techniques  follow  different  approaches  to 
the  detection  of  faults,  so  a  broad  examination  of  fault  detection  is  achieved.  From 
a  practical  point  of  view,  each  of  these  techniques  is  relatively  easy  to  use,  and  each 
technique  is  within  the  capability  of  undergraduate  student  participants.  Each  of 
these  techniques  have  also  been  examined  in  previous  empirical  studies,  so  compar- 
ison of  the  results  of  this  study  with  the  results  of  other  researchers  is  possible. 

One  of  the  techniques  that  have  been  studied  empirically  is  n-version  program- 
ming. These  studies  have  often  revealed  limitations  in  this  technique.  Despite 
discouraging  empirical  results,  n-version  programming  is  gaining  acceptance  among 
software  developers:  n-version  programming  has  been  used  in  software  to  control 
aircraft [30]  and  railroadsflS]  and  has  been  proposed  for  nuclear  power  plants[17]. 
One  drawback  to  the  multiversion  technique  is  that  the  total  development  costs  are 
increased  due  to  the  costs  of  developing  multiple  versions.  Since  in  most  situations 
there  is  a  fixed  amount  of  resources  that  can  be  invested  in  building  the  software 
there  must  be  saving  somewhere  else  in  order  to  allow  multiple- version  development. 


In  order  to  make  the  technique  affordable,  it  has  been  suggested  that  n-version 
programming  will  be  so  effective  that  it  can  be  used  as  a  partial  substitute  for  cur- 
rent software  verification  and  validation  procedures [3].  Instead  of  extensive  V&V 
of  a  single  program,  Avizienis  and  Kelly[4]  suggest  that  independent  versions  can 
be  executed  in  an  operational  environment  and  V&V  completed  while  using  the 
software.  Furthermore,  by  employing  hobbyist  programmers  to  write  multiple  soft- 
ware versions  for  critical  systems  at  home,  they  suggest  it  will  be  possible  to  relax 
the  need  for  rigorous  quality  control  and  centralized  programming  environments. 
Although  it  is  difficult  to  believe  that  anyone  would  take  these  latter  suggestions 
seriously,  the  need  to  reduce  costs  when  developing  multiple  versions  has  led  to  at 
least  one  attempt  to  reduce  testing  of  safety-critical  software  for  commercial  aircraft 
based  on  the  argument  that  the  use  of  n-version  programming  provides  such  high 
reliability  that  unit  testing  can  be  reduced  But  the  hypothesis  that  testing  can  be 
reduced  in  multiversion  software  needs  to  be  carefully  examined,  both  empirically 
and  theoretically,  before  modifying  current  software  development  practices.  No  such 
evaluation  has  yet  been  performed. 

The  goal  of  our  study  is  to  compare  some  fault  elimination  and  fault- tolerance 
techniques  in  terms  of  faults  detected  in  order  to  provide  data  on  the  issues  and 
questions  raised  by  previous  experiments.  After  a  survey  of  the  relevant  research, 
this  paper  describes  the  experimental  procedure  used  in  our  study,  characterizes  the 
results  and  summarizes  the  impact,  of  those  results  in  providing  direction  for  future 
research. 


2     Related  Work 

2.1      Fault-Elimination  Experiments 

There  are  a  large  number  of  studies  examining  software  testing.  Much  of  the  re- 
cent work  has  focused  on  assessing  the  effectiveness  of  various  testing  techniques. 
Myers[27]  did  a  comparative  study  of  functional  testing  against  code  reading  for 
fault  detection  in  small  FORTRAN  programs.  The  code-reading  methodology  used 
was  an  informal  desk  check  conducted  by  3  participants.  Myers  found  a  wide  vari- 
ation between  individuals,  but  no  significant  difference  between  the  performance  of 
the  two  techniques. 

A  study  by  Hetzel[19]  compared  code  reading,  structural  testing  and  functional 
testing  in  terms  of  the  faults  detected  by  each  technique.  In  that  study,  39  ex- 
perienced subjects  tested  three  PL/I  programs  ranging  in  length  from  64  to  170 
statements.  The  programs  were  a  sequential  series  of  nested  data  transformations. 
The  code-reading  technique  used  in  the  study,  developed  by  Hetzel,  exploited  this 
structure  to  summarize  the  effect  of  the  programs  by  summarizing  the  effect  of  each 
transformation.  The  structural-testing  criterion  used  was  statement  coverage.  The 
study  found  that  functional  testing  discovered  the  most  faults  and  code  reading  the 
least,  with  structural  testing  falling  in  between.  Code  reading  detected  faults  for 
which  test  cases  are  hard  to  derive,  and  it  detected  initialization  faults. 

Basili  and  Selby[6]  compared  code  reading  by  stepwise  abstraction  with  func- 
tional and  structural  testing  in  four  small  programs  (145  to  365  lines  long)  written  in 
an  Algol-like  language.  Three  of  the  programs  contained  naturally-occurring  faults. 
while  the  fourth  contained  a  mixture  of  naturally-occurring  and  seeded  faults.  The 


structural-testing  criterion  used  was  statement  coverage.  Functional  testing  used 
equivalence  partitioning  and  boundary- value  analysis. 

In  the  Basili  and  Selby  study,  code  reading  by  stepwise  abstraction  detected  more 
faults  than  either  of  the  other  techniques  studied.  Structural  testing  detected  fewer 
faults  than  functional  testing.  The  study  also  compared  the  types  of  faults  detected 
by  each  method  using  two  classification  schemes:  omission  vs.  commission  and 
the  type  of  operation  in  which  the  fault  was  present  (initialization,  control,  data, 
computation,  interface  or  output).  Code  reading  and  functional  testing  detected 
insignificantly  different  numbers  of  each  class  of  faults  except  interface  faults,  where 
code  reading  detected  significantly  more,  and  control  faults,  where  functional  testing 
detected  significantly  more.  In  each  case,  the  structural  testing  detected  either 
significantly  less,  or  insignificantly  different  numbers  of  faults  of  each  type.  The 
study  did  not  provide  information  on  the  size  of  the  intersections  of  the  sets  of 
faults  detected  by  each  method  nor  on  the  sets  of  faults  detected  by  combinations 
of  methods. 

Girgis  and  Woodward[16]  compared  the  fault-detection  abilities  of  four  types  of 
testing:  weak  mutation  testing,  data-flow  testing,  control-flow  testing  and  static 
data-flow  analysis.  The  comparison  used  a  set  of  small  FORTRAN  programs  (text- 
book examples)  that  were  seeded  with  faults  one  at  a  time  by  an  automated  tool, 
then  tested  until  either  the  seeded  fault  was  detected  or  the  testing  criteria  were 
satisfied.  The  results  indicate  a  large  variation  in  the  effectiveness  of  the  testing 
criteria.  Analysis  of  the  experimental  data  shows  an  insignificant  difference  between 
the  four  groups,  due  to  the  large  variation  between  the  criteria  in  each  group.  The 
differences  between  the  individual  criteria  were  significant.  The  most  effective  crite- 


rion  was  All-LCSAJs  (Linear  Code  Sequence  and  Jump).  However,  this  study  failed 
to  indicate  if  this  effectiveness  result  was  due  to  the  choice  of  faults  seeded  in  the 
programs  or  to  characteristics  of  the  detection  techniques  (e.g.,  the  seeding  strategy 
may  have  favored  All-LCSAJs).  Moreover,  the  results  may  have  been  influenced  by 
the  particular  testing  strategies  of  each  type  used. 

Ramamoorthy  and  Ho[29]  studied  two  forms  of  static  data-flow  analysis  on  large 
FORTRAN  programs.  Their  results  confirmed  the  limitations  of  static  data-flow 
analysis,  but  faults  were  detected  during  their  experiment.  In  the  2,000  line  pro- 
gram analyzed  in  that  study,  simple  static  data-flow  analysis  (analysis  for  unreach- 
able code,  interface  inconsistencies  and  locally-uninitialized  variables)  detected  four 
faults.  In  a  separate  23,000  line  program,  a  more  comprehensive  static  data-flow 
analysis  (performing  more  thorough  uninitialized-variable  checking,  loop-increment 
checking  and  analysis  for  branch  anomalies)  detected  20  faults. 

There  have  been  proposals  to  use  multi-version  voting  in  the  testing  process 
[7,  8,  28,  31].  In  this  method,  known  as  back-to-back  testing,  the  vote  itself  is  used 
as  the  test  oracle,  and  therefore  a  larger  number  of  tests  can  be  executed.  The 
underlying  assumptions  here  are  that  (1)  given  that  a  fault  leads  to  an  erroneous 
output  it  will  be  detected  by  the  voting  process,  and  (2)  the  faults  that  would 
have  been  detected  by  other  testing  techniques,  such  as  structural  testing  or  static 
analysis  techniques,  will  be  elicited  and  detected  by  voting  on  random  or  functional 
test  cases  alone. 

A  study  by  Bishop  et.  al.[7]  examined  back-to-back  testing  by  varying  the  speci- 
fication language,  development  practices  and  implementation  language  used  for  the 
versions.   Three  professionally-developed  versions  were  used  and  seven  faults  were 


detected  by  back-to-back  testing,  after  an  initial  acceptance  test.  Of  the  seven 
faults,  two  were  common  between  two  of  the  three  versions  used.  No  independent 
method  of  verification  of  the  results  was  used  so  three-way  failures  could  not  be 
detected.  However,  these  results  are  difficult  to  interpret  due  to  two  factors.  The 
experiment  used  a  small  number  of  versions,  which  causes  the  results  to  be  strongly 
affected  by  characteristics  of  the  individual  versions,  with  no  capability  of  identifying 
what  characteristics  are  due  to  the  different  techniques  that  produced  the  versions. 
The  experiment  also  lacked  isolation  between  independent  variables.  Two  of  the 
versions  were  coded  in  the  same  implementation  language  (FORTRAN  77).  One  of 
these  two  and  the  third  version  used  a  common  specification  (jointly  prepared),  but 
all  three  used  different  design  methods.  This  design  cannot  distinguish  between  the 
effects  of  varying  design  methods  and  the  effects  of  different  language-specification 
mixtures. 

Given  the  contradictory  results  of  previous  studies,  it  appears  that  no  simple 
rules  exist  for  choosing  among  these  testing  techniques.  Furthermore,  while  relative 
comparisons  of  number  of  errors  detected  provide  some  basis  for  choosing  between 
mutually  exclusive  alternatives,  this  is  not  necessarily  the  situation  with  respect  to 
testing.  Although  limited  resources  and  time  usually  forces  limitations  in  the  total 
amount  of  testing  performed,  one  would  probably  want  to  apply  more  than  one 
method  for  detecting  software  faults.  It  would  be  helpful  to  have  information  on  the 
degree  to  which  two  techniques  are  complementary,  i.e.,  detect  different  errors,  or 
redundant,  i.e.,  detect  the  same  errors,  along  with  more  detailed  information  about 
the  particular  errors  (and  hopefully  classes  of  errors)  detected  and  not  detected. 
Some  of  this  information  can  be  derived  bv  theoretical  analvsis  while  some  will 


require  empirical  study  since  human  behavior  and  capabilities  are  involved  for  which 
few  adequate  models  exist. 

2.2      Fault-Tolerance  Experiments 

There  have  been  very  few  experiments  that  have  explored  the  use  of  assertions 
for  fault  tolerance.  A  study  by  Anderson [2]  applied  recovery  blocks,  which  use 
assertions  to  test  the  system  state,  to  a  real-time  control  system.  The  code  (a 
professionally-implemented  version  of  a  submarine-control  program)  was  8000  lines 
long  and  organized  into  14  concurrent  activities.  The  results  showed  that  while 
assertions  were  quite  difficult  to  formulate,  some  reliability  improvement  was  gained 
through  the  use  of  recovery  blocks. 

Using  software  from  a  voting  experiment [22],  Leveson,  Cha,  Knight  and  Shimeall 
[24]  had  a  set  of  24  students  instrument  eight  versions  of  a  Pascal  program  (varying 
in  length  from  400  to  800  lines).  These  versions  were  all  known  to  produce  cor- 
rect results  in  excess  of  99.5%  of  the  time.  The  investigators  had  the  participants 
plan  their  augmentation  of  the  source  code  working  strictly  from  the  specification 
and  then  proceed  to  augment  the  code  with  the  planned  assertions,  plus  any  other 
assertions  the  participants  desired.  The  investigators  tested  the  24  instrumented 
programs  using  random  data  as  well  as  data  on  which  the  original  versions  were 
known  to  fail.  The  assertions  detected  only  14  of  the  60  known  faults  in  the  24 
instrumented  versions.  However,  assertions  detected  6  additional  faults  (not  previ- 
ously known).  Examining  the  cases  where  known  faults  were  not  detected  revealed 
three  causes  for  ineffective  checks:  bad  check  placement,  bad  check  condition,  and 
use  of  faulty  code  from  the  original  version  in  the  check.   These  results  show  that 


assertions  are  effective  in  detecting  faults  even  in  software  of  relatively  high  quality. 
In  this  study,  the  assertions  found  as  many  faults  as  voting  (but  largely  different 
faults).  Lastly,  the  results  showed  that  specif]  cat  ion -based  checks  alone  were  not  as 
effective  as  using  them  together  with  code-based  checks. 

There  have  been  several  experiments  involving  the  use  of  n- version  programming. 
The  first,  by  Chen[12],  provided  little  information  because  of  difficulties  in  executing 
the  experiment.  However,  it  was  noted  that  10%  of  the  test  cases  caused  failures 
for  the  3-version  systems  (35  failures  in  384  test  cases).  Chen  reported  that  there 
were  several  types  of  design  faults  that  were  not  well  tolerated  in  this  experiment, 
in  particular  missing-case  logic  faults. 

Avizienis  and  Kelly[4]  examined  the  use  of  multiple  specification  languages  in 
developing  multi-version  software.  The  reported  data  indicates  that  in  over  20% 
of  100  test  cases  executed,  the  three  version-systems  were  unable  to  agree  or  voted 
on  a  wrong  answer.  In  addition,  11  of  the  18  programs  aborted  on  invalid  input. 
Despite  these  results,  they  conclude  that  "By  combining  software  versions  that 
have  not  been  subjected  to  V&V  testing  to  produce  highly  reliable  multiversion 
software,  we  may  be  able  to  decrease  cost  while  increasing  reliability.'"  The  data 
in  this  experiment  does  not  support  the  hypothesis  implicit  in  this  statement  that 
high  reliability  will  be  achieved  by  using  this  technique  on  software  that  has  not 
been  rigorously  tested.  They  continue  to  say  that  "Most  errors  in  the  software 
versions  will  be  detected  by  the  decision  algorithm  during  on-line  production  use  of 
the  system."  There  were  816  combinations  of  the  programs  in  this  experiment  each 
run  on  100  test  cases  for  a  total  of  81,600  calculated  results.  In  5.6%.  of  the  cases 
where  an  error  occurred  in  at  least  one  version,  the  error  was  not  detected  bv  the 


10 


voting  procedure,  i.e.,  the  programs  agreed  on  a  wrong  answer. 

Another  experiment,  by  Knight  and  Leveson[23],  found  that  with  27  programs 
run  on  1,000,000  test  cases,  an  error  was  not  detected  by  voting  three  versions  in 
35%  of  the  cases  where  an  error  actually  occurred.  The  individual  programs  in 
this  experiment  had  a  much  higher  average  reliability  than  in  the  Kelly  experiment 
(i.e.,  0.9993  versus  0.72)  indicating  that  they  were  more  thoroughly  tested  before 
being  subjected  to  the  voting  procedure.  The  results  provide  some  justification  for 
hypothesizing  that  faults  leading  to  correlated  failures  are  more  likely  to  escape 
detection  during  testing  than  faults  that  do  not  lead  to  correlated  failures. 

Knight  and  Leveson[22]  investigated  the  problems  of  common  failures  between 
independently-produced  versions  and  have  also  looked  at  reliability  improvements 
[23].  Although  the  failure  probability  was  decreased  (about  10  times)  using  three- 
version  voting  compared  to  single  versions  in  the  latter  study,  this  comparison  may 
not  be  a  realistic  one.  It  is  reasonable  to  expect  that  applying  some  reliability-en- 
hancing technique  would  produce  an  improvement  over  not  applying  any  special 
techniques.  A  more  realistic  comparison  is  to  examine  the  reliability  of  multiple 
versions  voted  together  versus  the  reliability  of  a  single  version  with  additional 
reliability-enhancement  techniques  applied. 

Although  it  was  not  the  original  goal,  there  is  a  study  that  provides  one  data 
point  in  this  comparison.  Brunelle  and  Eckhardt[ll]  took  a  portion  of  the  SIFT 
operating  system,  which  was  written  from  a  formally-verified  design  at  SRI[26],  and 
ran  it  in  a  three-way  voting  scheme  with  two  new  (non-formally  verified)  versions. 
The  results  showed  that  although  no  faults  were  found  in  the  original  SRI  version, 
there  were  instances  where  the  two  unverified  versions  outvoted  the  correct,  verified 
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version  to  produce  a  wrong  answer2.  Care  must  be  taken  in  using  this  data  because 
the  qualifications  of  the  implementors  of  the  three  versions  may  be  different. 

A  study  by  Avizienis,  Liu  and  Shiitz[5]  tried  to  examine  the  use  of  multiple  im- 
plementation languages  in  n-version  programming.  Six  versions  of  a  flight-control 
program  were  implemented  separately  (one  in  each  of  six  languages)  and  executed 
in  triplets  to  look  for  faults.  It  is  difficult  to  use  the  results  of  this  experiment  to 
understand  n-version  programming,  since  a  detailed  design  specification  (including 
the  algorithms  to  be  used  and  the  form  of  the  internal  state)  was  included  as  part 
of  the  problem  specification.  Since  there  was  minimal  design  diversity,  there  was 
unlikely  to  be  detection  of  design  errors,  only  coding  errors  or  typographical  errors. 
For  example,  two  of  the  versions  were  found  to  be  erroneous  because  a  hand-written 
value  of  65,536  was  misread  as  65.536.  Furthermore,  the  versions  were  evaluated 
to  reach  agreement  among  themselves  without  any  use  of  a  separate  standard  of 
correctness.  After  a  carefully-controlled  development  involving  reviews  of  the  ver- 
sions for  discrepancies,  the  investigators  found  few  functional  differences  between 
the  versions  (which  is  not  surprising  given  that  the  programmers  were  all  given  the 
same  design  specification). 

Examining  the  results  obtained  by  the  previous  experiments  reveals  several  char- 
acteristics of  n-version  programming.  First,  the  prevalence  of  coincident  failures 
(observed  in  every  experiment  conducted  thus  far)  reduces  the  effectiveness  of  multi- 
version  voting  in  dealing  with  faults.  Second,  there  appears  to  be  substantial  dif- 
ficulty in  getting  versions  to  agree  on  a  consensus  result.  Research  has  found  that 
even  mathematically  correct  algorithms  sometimes  produce  differing  results  due  to 


2These  results  are  not  reported  in  the  published  paper  on  the  experiment,  but  were  obtained 
through  personal  communication  with  one  of  the  authors. 
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numeric  instability [9].  Agreement  concerns  were  cited  in  the  decision  by  Avizienis, 
Liu  and  Shiitz[5]  to  specify  the  algorithm  to  be  used  in  their  versions.  However, 
this  solution  essentially  sacrifices  the  idea  of  version  or  design  diversity,  by  limiting 
it  to  the  syntactic  level,  which  may  have  only  minor  effect  on  software  reliability. 

Finally,  analysis  of  the  experiments  shows  that  multi-version  voting  is  a  fault- 
specific  technique.  Decisions  on  the  portion  of  the  system  to  be  written  in  a  multi- 
version  manner  and  on  what  algorithms  are  to  be  used  in  the  different  versions  limit 
the  class  of  faults  that  these  techniques  are  able  to  handle.  The  known  problems  in 
handling  missing-case  logic  faults  indicate  this  specificity.  What  distinguishes  this 
technique  from  other  fault-specific  techniques  is  the  lack  of  an  explicit  characteri- 
zation of  what  faults  are  handled. 

Much  of  the  existing  research  on  n-version  programming  has  employed  uncon- 
trolled experimental  designs.  Claims  have  been  made  of  improvements  in  reliability 
due  to  these  techniques  in  comparison  with  unverified  software.  This  comparison  is 
unrealistic.  The  alternative  to  using  multi-version  voting  is  not  to  leave  the  software 
unverified,  but  to  apply  other  verification  and  validation  techniques. 

3      Experimental  Design 

A  set  of  programs  written  from  a  single  specification  for  a  combat  simulation  prob- 
lem are  used  in  the  study  described  in  this  paper.  The  specification  is  derived  from 
an  industrial  specification  obtained  from  TRW[13].  The  simulation  is  structured  as 
three  sets  of  transformations  from  the  input  data  to  the  output  data.  The  first  set 
of  transformations  converts  the  input  data  to  an  abstract  intermediate  state.  The 
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intermediate  state  is  updated  by  a  second  set  of  transformations  in  each  cycle  of 
simulated  time.  After  a  number  of  cycles  (specified  in  the  input  data),  the  output 
data  are  produced  by  the  third  set  of  transformations  from  the  final  intermediate 
state.  Prototype  implementations  were  developed  by  three  individuals  in  order  to 
evaluate  and  improve  the  quality  and  comprehensibility  of  the  requirements  speci- 
fication before  the  development  of  the  versions  began. 

The  experiment  participants  used  throughout  were  upper-division  computer  sci- 
ence students.  One  set  of  participants,  students  in  a  senior-level  class  on  advanced 
software  engineering  methods,  performed  all  design  and  implementation  activities 
on  the  program  versions.  A  disjoint  set  of  participants  attempted  to  detect  faults 
in  the  programs.  All  decisions  on  whether  or  not  to  report  a  section  of  code  as  a 
fault  were  made  by  a  student  participant  or  a  piece  of  software  written  by  a  student 
participant.  Once  the  reports  were  generated  (by  all  techniques),  the  administra- 
tor acted  as  final  arbiter  as  to  which  reports  identified  faults  and  which  were  false 
alarms;  this  decision  was  not  reported  to  the  students  during  their  participation  in 
the  experiment.  All  participants  were  trained  in  the  techniques  used  in  the  experi- 
ment; however,  none  had  applied  these  specific  techniques  on  any  projects  prior  to 
this  experiment  with  the  exception  of  previous  Pascal  programming  experience  by 
the  implementation  participants. 

The  development  activity  involved  26  individuals,  working  in  two-person  teams. 
Teams  were  assigned  randomly.  Of  the  13  teams,  8  eventually  produced  versions 
that  were  judged  acceptable  for  use  in  the  experiment.  The  development  activity 
involved  preparing  architectural  and  detailed  designs  for  the  software,  coding  the 
software  from  those  designs,  and  debugging  the  software  sufficiently  to  pass  the 
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version  acceptance  test.  The  version  acceptance  test  was  a  set  of  15  data  sets.  The 
data  sets  were  designed  to  execute  each  of  the  major  portions  of  the  code  at  least 
once.  The  acceptance  test  was  not,  and  was  not  intended  to  be,  a  basis  for  quality 
assessment  of  the  code,  but  rather  was  a  test  of  whether  all  major  portions  of  the 
code  were  present  in  some  operable  form.  The  goal  of  the  development  procedure 
was  to  have  the  versions  in  a  state  similar  to  that  of  normal  software  development 
immediately  prior  to  unit  testing. 

Table  1  describes  the  finished  versions.  The  column  marked  'Modules'  shows  the 
number  of  Pascal  procedures  and  functions  in  each  version.  The  size  of  the  source 
is  given  in  two  figures,  source  lines  and  code  lines,  with  the  latter  figure  omitting 
blank  and  comment  lines.  Finally,  the  complexity  of  each  module  in  each  version 
was  measured  using  McCabe's  Cyclomatic  Complexity,  V(G).  These  figures  are 
profiled  in  the  last  three  columns  of  the  table.  The  minimum  V(G)  is  omitted  since 
in  each  version  there  was  at  least  one  module  that  was  strictly  linear  in  structure 
(no  loops  or  branches),  and  thus  has  a  V(G)  of  1.  The  mean  code  length  is  1777 
lines,  with  a  standard  deviation  of  435. 

The  experimental  activity  involved  applying  five  different  fault  detection  tech- 
niques to  the  program  versions:  code  reading  by  stepwise  abstraction  [25],  static  data 
flow  analysis,  run-time  assertions  inserted  by  the  development  participants,  multi- 
version  voting,  and  functional  testing  with  follow-on  structural  testing.  The  code 
reading  was  performed  by  eight  individuals.  Prior  to  code  reading,  all  developer- 
inserted  comments  were  stripped  from  the  version  source  code.  Each  version  was 
read  by  one  person,  and  each  person  read  only  one  version.  The  data-flow  analysis 
was  performed  by  implementing  and  executing  an  analysis  tool  based  on  algorithms 
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Version 

Module  V(G) 

# 

Modules 

Source  Lines 

Code  Lines 

Mean 

Std.  Dev. 

Max 

1 

72 

7503 

2414 

4.671 

6.844 

47 

2 

56 

3452 

1540 

3.055 

4.125 

13 

3 

41 

1480 

1201 

3.698 

5.549 

25 

4 

57 

3663 

2003 

4.825 

5.282 

21 

5 

28 

1834 

1544 

7.694 

13.011 

58 

6 

72 

3065 

2206 

2.868 

3.423 

16 

7 

75 

2734 

1978 

3.427 

5.084 

28 

8 

57 

1896 

1331 

2.193 

2.503 

10 

Table  1:  Version  Source  Profile 

by  Fosdick  and  Osterweil[14]. 

The  development  participants  were  trained  in  writing  run-time  assertions  and 
were  required  to  include  assertions  in  their  versions.  The  run-time  assertions  were 
present  during  the  application  of  all  techniques.  If  an  assertion  condition  fails,  a 
message  is  generated. 

A  "gold"  version  has  been  written  by  the  experiment  administrator  as  an  aid  for 
fault  diagnosis,  but  this  actually  just  provides  another  version  to  check  against.  In 
fact,  faults  in  the  gold  version  have  been  detected.  The  gold  version  is  not  included 
in  the  experimental  data.  It  is,  of  course,  possible  that  failures  common  to  all 
of  the  versions,  including  the  gold,  will  not  be  detected.  This  is  an  unavoidable 
consequence  of  this  type  of  experiment. 

To  examine  fault  tolerance  by  voting,  we  chose  three  versions  because  more  than 
this  number  of  versions  would,  in  most  cases,  be  totally  unrealistic,  and  two- version 
voting  does  not  provide  fault  tolerance.  When  at  least  one  version  in  a  two-version 
system  provides  an  erroneous  result,  there  is  no  chance  of  masking  (or  tolerating) 


10 


that  failure  and  either  no  answer  or  a  wrong  answer  is  provided.  All  versions  were 
executed  on  a  set  of  10,000  randomly-generated  test  cases.  The  test  data  generator 
has  been  designed  to  provide  realistic  test  cases  according  to  an  expected  usage 
profile  in  the  operational  environment. 

Functional  testing  augmented  by  structural  testing  was  performed  on  the  pro- 
grams. A  series  of  97  functional  test-data  sets  were  generated  from  the  specification 
by  trained  undergraduates.  These  data  sets  were  planned  using  the  abstract  func- 
tion technique  described  by  Howden[20].  Part  of  each  plan  was  a  description  of 
the  program  instrumentation  needed  to  view  the  output  of  each  abstract  function. 
The  structural  coverage  of  the  functional  data-sets  was  measured  using  the  ASSET 
structural  testing  tool[15],  and  sufficient  additional  data  sets  were  defined  to  bring 
the  coverage  up  to  the  all-predicate-uses  level.  The  participants  used  a  total  of  60 
additional  data  sets  to  achieve  all-predicate-uses  coverage  in  all  procedures  in  all 
versions.  The  number  of  data  sets  executed  by  each  individual  version  varied  from 
5  to  13,  as  needed  to  achieve  the  required  coverage. 

Because  some  of  the  techniques  applied  to  the  programs  are  open-ended  in  terms 
of  possible  application  of  resources,  it  was  necessary  to  attempt  to  hold  relatively 
constant  the  resources  allocated  to  each  technique.  This  was  not  necessary  for 
those  techniques,  namely  static  data  flow  analysis  and  code  reading,  that  have  a 
fixed  and  relatively  low  cost.  Table  2  contains  the  amount  of  human  hours  and 
computer  hours  devoted  to  each  technique.  The  time  devoted  to  software  testing 
and  voting  is  approximately  two  calendar  months  per  version  for  both. 

In  general,  the  student  participants  took  their  efforts  seriously  and  performed 
careful  work  in  this  experiment.   In  the  code  reading,  the  average  rate  of  analysis 
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Computer  Hours 

Human  Hours 

Technique 

Mean        SD 

Min.     Max. 

Mean     SD     Min.     Max. 

Code  Reading 

0           0 

0            0 

36      15         19          60 

Static  Analysis 

40         30 

0.5        104 

1     0.1      0.75       1.25 

Software  Test 

84         63 

36       219 

373        4       366        378 

Voting 

1415     1055 

600      3692 

6        14            8 

Table  2:  Hours  Devoted  to  Each  Technique  Per  Version 

and  annotation  was  70  lines  per  hour.  Analysis  of  the  annotations  shows  that  the 
students  were  reading  in  depth  and  not  simply  skimming  the  source  listings.  In  the 
functional  test  planning,  similar  attention  to  detail  was  displayed  by  the  students. 
It  is  possible  for  test  plans  to  be  of  variable  quality,  so  three  outside  industrial 
experts  evaluated  the  plans  in  an  attempt  to  gauge  the  plan  quality  before  testing 
the  versions.  In  their  opinion  the  plans,  while  obviously  not  the  work  of  experienced 
professionals,  were  comparable  to  those  used  in  many  industrial  settings. 

4      Results 

Two  general  categories  of  questions  have  guided  our  analysis  of  the  data.  The  first 
is  a  comparison  between  fault  elimination  and  fault-tolerance  techniques,  i.e.,  are 
they  substitutes  for  each  other  or  do  they  complement  each  other.  The  second 
category  of  questions  involves  comparing  various  testing  techniques  with  respect  to 
fault  detection,  including  consideration  of  their  relative  strengths  and  weaknesses 
and  how  these  techniques  might  be  improved. 
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4.1      Fault  Tolerance  and  Fault  Elimination 

Before  presenting  the  data,  it  is  necessary  to  define  some  terms.  In  a  three- version 
voting  system,  there  are  three  possible  results:  a  correct  answer  is  produced  (there 
are  at  least  2  programs  in  the  triplet  that  produce  correct  answers),  no  answer  is 
produced  (three  different  results  are  produced),  or  a  wrong  answer  is  agreed  upon 
(at  least  two  programs  produce  identically  wrong  answers).  If  a  correct  answer 
is  produced  despite  the  failure  of  one  of  the  programs,  then  the  triplet  is  fault 
tolerant  and  the  single  error  is  masked.  If  no  agreement  is  reached,  then  one  could 
say  that  the  individual  program  failures  were  detected,  but  fault  tolerance  (run- 
time masking)  has  not  been  achieved.  In  the  third  case,  i.e.,  producing  an  incorrect 
result,  the  failure  is  not  detected  and  the  faults  have  not  been  tolerated. 

To  examine  whether  fault-tolerance  techniques  are  substitutes  for  fault-elimin- 
ation techniques,  we  shall  consider  the  faults  tolerated  by  3-version  voting  and 
potentially  tolerated  by  techniques  using  assertions.  Note  that  the  assertions  them- 
selves provide  no  fault  recovery  ability,  but  may  be  used  in  conjunction  with  either 
forward  or  backward  recovery  strategies  to  tolerate  faults  once  they  are  detected. 
No  recovery  strategies  were  implemented  in  the  programs  used  in  this  experiment. 
This  means  that  the  results  relating  to  assertions  should  be  viewed  as  counts  of 
faults  potentially  tolerated  (if  the  hypothetical  recovery  techniques  were  effective) 
rather  than  faults  actually  tolerated.  This  contrasts  with  fault  tolerance  by  voting, 
where  the  same  mechanism  (a  vote)  is  used  to  detect  and  tolerate  faults.  The  results 
in  this  section  relating  to  voting  are  counts  of  faults  actually  tolerated  by  3-version 
voting  in  this  experiment. 

The  first  question  that  was  investigated  is  whether  run-time  voting  tolerated  the 
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faults  detected  by  the  fault-elimination  techniques  used,  i.e,  functional  testing  aug- 
mented by  some  structural  testing,  code  reading,  and  static  analysis.  This  question 
tests  the  previously-quoted  hypothesis  by  Avizienis  and  Kelly[4]  that  multiversion 
voting  may  reduce  or  replace  traditional  software  V&V. 

If  voting  tolerates  the  faults  detected  by  testing,  then  elimination  or  reduction  in 
testing  can  possibly  be  justified,  and  testing  could  be  completed  while  the  software  is 
being  used.  However,  if  the  faults  detected  by  fault  elimination  are  not  tolerated  by 
voting  at  run-time,  then  testing  cannot  be  eliminated.  Furthermore,  any  argument 
for  reduction  of  testing  would  need  to  prove  that  the  reduction  in  testing  merely 
results  in  the  non-detection  (and  non-elimination)  of  the  faults  that  voting  will 
reliably  tolerate  and  does  not  result  in  the  software  containing  faults  causing  run- 
time failures  that  might  have  been  detected  and  eliminated  by  increased  testing. 

There  are  two  aspects  to  answering  this  question.  The  first  is  whether  the  same 
faults  are  detected  by  the  fault-elimination  techniques  and  tolerated  by  voting.  If 
not,  then  testing  cannot  be  eliminated  and  reduction  of  the  amount  of  each  type 
of  testing  cannot  be  justified.  However,  testing  can  also  be  reduced  by  eliminating 
just  one  type  of  testing.  Therefore,  it  is  also  necessary  to  investigate  whether  one 
particular  type  of  testing  is  superfluous  when  using  voting.  The  first  aspect  is 
covered  in  the  rest  of  this  section  of  the  paper;  the  second  is  discussed  in  the  next 
section  where  the  faults  detected  by  each  technique  are  compared. 

The  programs  were  executed  on  10,000  randomly-generated  data  sets.  In  general, 
we  found  that  the  faults  that  were  tolerated  were  not  the  same  as  the  faults  that  were 
detected  by  fault-elimination  techniques.  In  addition,  of  the  56  total  voting  triplets, 
none  tolerated  more  than  a  relatively  small  fraction  of  the  faults  detected  by  the 
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fault-elimination  techniques.  The  best  triplet  tolerated  only  41  faults  (compared  to 
107  detected  by  all  techniques  in  the  versions  (numbers  4,  5  and  6)  that  participated 
in  that  triplet).  All  triplets  were  unreliable  in  tolerating  even  those  faults  they  could 
tolerate  (i.e.,  those  faults  they  tolerated  at  least  once). 

Table  3  shows  the  number  and  intersection  of  faults  found  by  each  class  of  tech- 
nique, i.e.,  voting,  assertions  and  the  three  fault-elimination  techniques.  Assertions 
are  included  in  the  results  that  follow  solely  to  better  characterize  the  relationship 
between  the  fault  tolerance  and  fault-elimination  approaches.  Note  the  relatively 
small  number  of  faults  that  were  both  tolerated  by  voting  and  detected  by  a  fault- 
elimination  technique  (27,  given  by  the  sum  of  the  last  two  lines  of  table  3).  It  is 
also  interesting  to  note  that  assertions  and  voting  detected  only  a  few  of  the  same 
faults.  This  supports  the  results  by  Leveson,  Cha,  Knight  and  Shimeall[24]. 

A  second  question  involves  the  relationship  between  coincident  failures  and  test- 
ing. There  has  been  speculation  about  whether  the  faults  that  result  in  coincident 
failures  (and  thus  reduce  the  fault-tolerance  capability  of  voting  systems)  are  likely 
to  be  detected  through  testing  procedures.  Examination  of  the  specific  faults  that 
were  detected  by  testing  indicates  that  testing  detected  only  24  of  the  103  common- 
failure  faults.  Furthermore,  the  common-failure  faults  found  by  testing  did  not 
include  those  faults  that  produced  the  majority  of  the  common  failures  during  the 
executions.  This  result  occurred  despite  the  fact  that  some  of  the  testing  techniques 
target  the  testing  of  special  cases,  which  are  often  involved  in  common  failures.  This 
result  may  indicate  that  the  faults  that  reduce  the  effectiveness  of  n-version  pro- 
gramming are  among  the  most  difficult  to  detect.  This  is  not  surprising  and  satisfies 
the  intuitive  explanation  that  the  parts  of  the  problem  that  lead  to  mistakes  by  the 
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programmers  may  be  equally  difficult  for  the  testers  to  handle.    That  is,  humans 
are  unlikely  to  make  mistakes  in  a  random  fashion. 


Version 

1 

2 

3 

4 

5     6 

7 

8 

Total 

Tolerated  by  vote 

7 

7 

11 

8 

14     7 

5 

8 

67 

Detected  by  assertions  only 

4 

3 

1 

8 

2     2 

4 

4 

28 

Detected  by  fault  elim.  only 

4 

17 

27 

13 

15     4 

13 

26 

119 

Both  assert  &  fault  elim. 

7 

1 

1 

3 

6     3 

0 

0 

21 

Both  assert  Sz  vote 

0 

0 

2 

0 

1     0 

2 

3 

8 

Both  vote  k.  fault  elim. 

0 

2 

3 

3 

2     4 

5 

2 

21 

Assert,  vote  &:  fault  elim. 

0 

0 

1 

1 

2     1 

0 

1 

6 

Table  3:  Number  of  Faults  Tolerated  or  Detected 


Finally,  it  is  interesting  to  ask  if  there  were  any  faults  tolerated  by  run-time 
voting  that  were  not  detected  by  the  fault-elimination  techniques.  If  so,  then  the 
use  of  fault  elimination  does  not  preclude  the  use  of  fault  tolerance,  i.e.,  they  are 
complementary  techniques  rather  than  competitive  techniques.  Again,  table  3  shows 
that  this  did  occur  for  67  faults,  although,  as  discussed  in  the  rest  of  this  section, 
the  average  individual  triplet  tolerated  33  faults  (of  the  104  faults  present  in  the 
average  individual  triplet)  and  did  not  do  this  consistently.  Firm  conclusions  cannot 
be  drawn  from  this  data  given  the  novice  nature  of  the  participants  in  the  fault- 
elimination  efforts,  but  it  does  raise  interesting  questions  for  further  study. 

Voting  actually  did  somewhat  worse  in  comparison  with  testing  than  is  indicated 
by  the  data  in  table  3.  In  the  table,  run-time  voting  is  credited  with  tolerating  a 
fault  if  it  tolerates  at  least  one  failure  caused  by  that  fault  even  though  it  may  not 
tolerate  every  failure  caused  by  the  fault.  It  is  also  credited  with  tolerating  a  fault 
if  only  one  or  several  of  the  56  combinations  of  versions  tolerate  it  even  though  all 


of  them  do  not.  The  situation  is  different  for  fault  elimination  since  the  detection 
of  a  fault  by  a  fault-elimination  technique  leads  to  elimination  of  the  fault  prior  to 
production  use  of  the  software  and  elimination  of  all  failures  related  to  that  fault.  In 
general,  we  found  that  even  when  the  failure  caused  by  a  fault  is  at  times  tolerated 
by  a  triplet,  it  is  usually  not  tolerated  every  time,  and  there  is  wide  variation  among 
the  different  triplets  in  terms  of  how  effective  they  were  in  tolerating  faults. 

In  order  to  show  the  variation,  we  computed  the  number  of  faults  tolerated  at 
least  once  by  a  triplet  divided  by  the  total  number  of  faults  that  caused  a  failure 
in  one  of  the  versions  comprising  that  triplet.  This  fraction  ranged  from  60.4%  to 
88.6%  with  an  average  of  75. 9%3  and  a  standard  deviation  of  6.2%.  That  is,  even 
the  best  triplet  missed  11%  of  the  faults  that  it  should  have  been  able  to  tolerate. 

Another  way  of  looking  at  variability  among  triplets  is  to  consider  the  conditional 
probability  that  a  triplet  will  mask  a  failure  given  that  a  failure  occurs  (i.e.,  the 
conditional  probability  that  a  correct  result  is  produced  despite  the  failure  of  one 
of  the  versions).  This  fraction  was  even  lower,  i.e.,  it  ranged  from  20.8%  to  61.5% 
with  a  mean  of  37.9%  and  a  standard  deviation  of  11.1%  (see  figure  1).  On  average 
the  triplets  only  tolerated  faults  38%  of  the  time  that  they  caused  a  failure.  This 
can  be  explained  by  the  large  number  of  correlated  failures  that  occurred. 

4.2      Comparison  of  Fault-Detection  Techniques 

Some  comparison  of  the  fault-detection  techniques  is  possible  with  this  data,  al- 
though absolute  numbers  may  not  be  important  because  of  the  problems  of  evalu- 


3Note  that  these  are  percentages  of  the  faults  present  in  the  three  versions  that  constitute  the 
voting  triplet  and  not  percentages  of  all  faults  found  in  all  versions. 
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Figure  1:  Conditional  Probability  of  Triplet  Fault  Masking 

ating  and  keeping  constant  the  amount  of  effort  put  into  each  technique.  Further- 
more, numbers  are  not  really  the  issue;  instead  a  more  important  question  may  be 
whether  different  or  similar  faults  are  found  by  each  technique.  A  technique  may 
only  find  one  fault,  but  if  that  fault  is  not  likely  to  be  found  in  any  other  way,  then 
that  technique  may  still  need  to  be  applied. 

4.2.1      Definition  of  Terms 

Before  discussing  the  results,  certain  key  concepts  need  to  be  clearly  defined.  In 
particular,  it  is  important  to  understand  what  a  fault  is  and  when  it  may  be  detected 
by  each  technique.  The  IEEE  standard  defines  a  fault  to  be  "an  accidental  condition 
that  causes  a  functional  unit  to  fail  to  perform  its  required  function." [1] 
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If  the  correction  of  a  section  of  code  eliminates  at  least  one  failure,  it  is  counted 
as  a  single  fault.  Several  faults  could  contribute  to  the  failure  for  a  given  data  set, 
and  several  failures  could  be  due  to  a  single  fault.  For  example,  a  single  data  set 
could  reveal  separate  faults  dealing  with  battalion  location  and  observation  of  one 
battalion  by  another.  If  either  of  these  faults  were  solely  responsible  for  failures  in 
other  data  sets,  they  are  counted  as  separate  faults.  If  correction  of  either  of  these 
faults  eliminates  the  failure,  they  are  counted  as  a  single  fault.  This  is  because 
faults  may  sometimes  be  due  to  actions  distributed  through  the  version  code.  For 
example,  if  a  failure  results  from  the  initialization  code  not  ensuring  an  assumption 
made  in  some  calculation  code,  this  is  counted  as  only  one  fault,  although  it  could 
be  corrected  either  by  changing  the  initialization  code  to  ensure  the  assumption  or 
by  changing  the  calculation  code  to  obviate  the  assumption. 

For  most  of  the  techniques  used  in  this  experiment,  determination  of  when  the 
techniques  detected  faults  is  straightforward.  Static  analysis  and  code  reading  iden- 
tify the  fault  precisely  to  the  section  of  code  in  which  it  occurs.  A  run-time  assertion 
generates  reports  when  faults  produce  an  erroneous  run-time  state.  Testing  detects 
a  fault  when  the  test-failure  conditions  are  satisfied  due  to  behavior  caused  by  a 
fault.  The  test-failure  conditions  are  explicitly  given  in  the  functional  test  plans 
and  are  developed  as  part  of  the  test  data  during  the  structural  testing.  Faults 
detected  during  analysis  of  the  version  source  code  to  formulate  the  structural  test 
data  are  also  considered  detected  by  structural  testing. 

For  back-to-back  testing,  the  conditions  when  a  fault  is  detected  are  less  clear. 
We  define  a  fault  as  detected  if  the  version  containing  it  is  identified  as  erroneous 
because  its  answer  differs  from  a  majority  of  the  versions.  For  example,  if  a  triplet 
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produces  Good- Good- Bad  results,  then  the  fault  responsible  for  producing  the  Bad 
output  will  be  counted  as  detected.  However,  Bad-Bad-Good  (where  the  two  Bad 
results  are  identical)  will  not  count  as  a  fault  detection  because  a  fault  has  not 
been  identified  in  the  incorrect  version,  but  instead  the  correct  version  has  been 
erroneously  identified  as  containing  a  fault.  That  is,  a  fault  is  counted  as  detected 
if  the  voting  process  isolates  the  version  that  is  faulty.  As  another  example,  Badl- 
Badl-Bad2  would  count  as  only  one  fault  detection  (the  fault  responsible  for  the 
Bad2  output)  while  Badl-Bad2-Bad3  would  count  as  three  faults  detected  since  all 
three  versions  would  need  to  be  debugged. 

It  could  be  argued  that  Badl-Badl-Bad2  would  actually  result  in  finding  two 
or  three  faults  because  in  trying  to  fix  the  Bad2  fault,  the  Badl  would  eventually 
be  found.  However,  we  feel  that  once  a  fault  is  located  and  fixed  in  the  Bad2 
version,  the  debuggers  would  probably  stop.  The  same  type  of  argument  could  be 
made  in  the  Bad-Bad-Good  case  where  eventually  the  debuggers  might  stumble 
onto  the  faults  causing  the  Bad  results  when  they  gave  up  attempting  to  fix  the 
Good  program.  This  seems  to  us  to  be  overly  optimistic.  The  tendency  will  be  to 
try  to  get  the  single  answer  to  match  the  multiple  answer  rather  than  vice  versa. 
In  fact,  there  were  occasions  when  we  temporarily  "broke"  a  correct  version  when 
trying  to  debug  it  and  get  it  to  match  the  majority  result.  We  felt  that  to  count  a 
fault  detection  technique  as  detecting  a  fault,  it  should  at  least  identify  the  program 
that  has  failed.  This  decision  is  arbitrary,  but  seemed  to  make  the  most  sense  to 
the  authors. 

Two-version  voting  detects  faults  if  the  pair  disagree  as  to  the  result.  Good-Bad 
is  counted  as  detecting  the  fault  that  caused  the  Bad  answer,  and  Badl-Bad2  is 
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counted  as  detecting  the  two  faults  that  caused  the  two  distinct  bad  results.  One 
side  effect  of  this  decision  is  that  all  faults  detected  by  three- version  voting  are  also 
(by  definition)  detected  by  two- version  voting,  but  the  converse  is  not  true. 

We  assumed  that  many  more  test  cases  can  be  executed  when  using  back-to- 
back  testing  with  random  generation  of  test  cases  because  of  the  lack  of  necessity 
to  apply  an  independent  validation  procedure  to  the  outputs  although  there  is  also 
a  necessity  to  write  a  test-harness  program  to  implement  the  voting.  Writing  a  test 
harness  is  not  necessarily  a  trivial  problem  when  real  numbers  are  involved  since 
different  correct  results  are  possible  from  different  (correct)  algorithms  due  to  the 
use  of  limited- precision  arithmetic.  Using  a  tolerance  for  the  comparisons  will  not 
solve  the  problem  [9].  In  previous  experiments,  this  consistent  comparison  problem 
has  resulted  in  time-consuming  debugging  of  correct  programs. 

The  reader  should  note  that  we  are  now  reinterpreting  our  experimental  proce- 
dures. In  the  previous  section,  we  identified  the  execution  of  the  10,000  input  cases 
as  a  simulation  of  the  production  use  of  the  software.  We  are  now  interpreting  this 
procedure  as  a  fault-elimination  technique  that  would  precede  the  actual  produc- 
tion use  of  the  programs.  There  is  no  problem  with  this  from  a  practical  standpoint 
since  the  procedures  are  identical  and  differ  only  in  the  time  they  are  performed, 
but  it  may  be  confusing  to  the  reader. 
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4.2.2      Fault  Detection  Summary 

Table  4  shows  the  number  of  faults  detected  by  each  technique4.  The  first  five  lines 
(those  marked  'only')  give  the  number  of  faults  detected  by  each  technique  that 
were  detected  by  none  of  the  other  techniques  (e.g.,  run-time  assertions  detected 
a  total  of  23  faults  that  were  not  detected  by  voting,  testing,  static  analysis  or 
code  reading).  The  remainder  of  the  table  gives  the  number  of  faults  detected  in 
common  by  the  techniques  named  on  each  line  (e.g.,  code  reading  found  a  total  of  4 
faults  that  were  also  found  by  run-time  assertions,  but  were  not  found  by  any  other 
technique). 

The  voting  technique  used  in  constructing  table  4  was  two- version  voting.  Three- 
version  voting  detected  112  of  the  123  faults  found  by  two- version  voting.  These 
faults  were  found  by  voting  with  the  eight  versions  combined  into  the  28  possible 
pairs  and  the  56  possible  triplets.  The  values  in  the  line  marked  'Voting  detection 
only'  table  4  are  the  maximum  and  minimum  of  the  faults  detected  by  each  two- 
version  voting  pair  for  each  version.  A  range  exists  for  voting  because  it  was  applied 
to  each  version  7  times  (the  number  of  two-version  voting  systems  in  which  each 
version  participated)  while  the  other  techniques  were  only  applied  once.  In  all  other 
voting  cases  (voting  in  combination  with  each  of  the  other  techniques),  there  were 
no  variations  in  the  number  of  faults  detected.  The  interesting  feature  of  table  4  is 
not  the  precise  values  shown  (which  depend  on  the  application),  but  that  most  of 
the  faults  detected  by  each  technique  were  found  by  no  other  technique. 


4 Note  that  the  figures  in  table  3  and  table  4  are  not  comparable  due  to:  a)  the  difference 
between  fault  detection  by  voting  and  fault  tolerance  by  voting  and  b)  the  difference  between 
2- version  and  3-version  voting 
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Version 

Method 

1 

2 

3 

4 

5 

6 

7 

8 

Total 

Testing  only 

2- Version  voting  only 

Code  reading  only 

2 

10 

n 
0 

12 

9 
9 

2 

21 
n 

12 

4 

11 

7 
8 

2 

13 

14 
14 
0 

1 

7 
8 
1 

11 

5 
6 
0 

10 

10 
10 

16 

81 

73 

78 

25 

Assertions  only 

3 

3 

1 

8 

1 

1 

3 

3 

23 

Static  analysis  only 

0 

0 

2 

0 

0 

0 

0 

0 

2 

Both  2-v.  voting  &  test 
Both  assertions  &  test 

3 

5 

1 
1 

1 
0 

3 
2 

2 

5 

6 
3 

4 
0 

0 
0 

20 

16 

Both  assertions  &  2-v.  voting 

0 

0 

2 

0 

2 

0 

4 

4 

12 

Both  reading  &;  assertions 

2 

0 

0 

1 

1 

0 

0 

0 

4 

Both  static  analysis  &;  2-v.  voting 

0 

0 

0 

0 

0 

1 

1 

0 

2 

Both  reading  &  2-v.  voting 

0 

0 

2 

0 

0 

0 

0 

0 

2 

Both  reading  &  test 

0 

0 

0 

0 

0 

0 

1 

0 

1 

Both  static  analysis  &  test 

0 

0 

0 

0 

0 

0 

1 

0 

1 

Assert  k,  2-v.  voting  Sz  test 

0 

0 

1 

1 

2 

1 

0 

0 

5 

Reading  &  2-v.  voting  fc  test 

0 

2 

0 

0 

0 

0 

0 

2 

4 

Table  4:  Number  of  Faults  Detected 
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4.2.3      Variation  in  Voting  Performance 

To  give  some  feeling  about  the  variability  of  the  voting  performance,  two  sets  of 
statistics  are  provided.  The  first  set  is  the  number  of  faults  detected  at  least  once 
by  each  pair  and  each  triplet,  divided  by  the  total  number  of  faults  that  caused  at 
least  one  failure  in  the  versions  making  up  the  system  (i.e.,  the  fraction  of  revealed 
faults  that  each  voting  system  detected).  For  the  28  two-version  voting  systems, 
the  fraction  of  faults  detected  varies  from  91.3%  to  100%  with  a  mean  of  97.9%  and 
a  standard  deviation  of  2.6%.  For  the  56  three-version  voting  systems,  the  fraction 
of  faults  detected  varies  among  the  triplets  from  90.5%  to  100%  with  a  mean  of 
96.5%  and  a  standard  deviation  of  2.5%.  In  short,  the  majority  of  pairs  and  triplets 
fail  to  detect  at  least  some  of  the  faults  revealed  by  the  input  data.  Since  virtually 
all  systems  would  be  developed  with  at  most  one  pair  or  triplet,  the  data  in  table  4 
represents  a  best  case  for  voting. 

The  second  set  of  statistics  used  to  analyze  the  variation  in  fault  detection  is 
the  conditional  probability  that  a  pair  or  triplet  detects  each  fault  given  that  it 
is  revealed.  The  mean  of  these  probabilities  over  all  faults  detected  for  each  two- 
version  system  varies  between  0.826  and  0.981,  with  a  mean  of  0.934  and  a  standard 
deviation  of  0.040  (see  figure  2).  In  other  words,  voting  with  a  two- version  system 
never  detected  all  of  the  failures  of  the  component  versions.  Voting  with  the  major- 
ity of  two- version  systems  failed  to  detect  5%  or  more  of  the  version  failures.  For 
the  three-version  voting  systems,  the  mean  of  the  conditional  probabilities  over  all 
faults  detected  by  each  system  varied  from  0.787  to  0.953  with  a  mean  of  0.886. 
(see  figure  3).  This  indicates  that  there  is  a  considerable  chance  that  three-version 
back-to-back  testing  will  fail  to  identify  erroneous  versions.    On  the  average,  this 
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happened  during  this  experiment  in  one  out  of  every  nine  version  failures. 
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Figure  2:  Two- Version  Conditional  Probability  of  Detection 


4.2.4      Back-To-Back  Testing 

There  are  two  particularly  interesting  comparisons  to  make  that  deal  with  currently 
unresolved  issues  in  testing  research.  The  first  is  the  use  of  back-to-back  testing 
vs.  the  use  of  other  testing  oracles  (i.e.,  those  not  involving  a  voting  procedure). 
Back-to-back  testing  allows  a  large  amount  of  data  to  be  executed  due  to  the  au- 
tomated nature  of  the  oracle,  and  it  has  been  advocated  as  a  way  of  extensively 
testing  complex  software  where  determining  a  correct  answer  by  a  non-voting  pro- 
cedure may  be  tedious  and  time-consuming[7,  8,  28,  31].  Of  course,  if  one  takes  a 
larger  perspective,  part  or  all  of  the  savings  in  testing  may  be  offset  by  the  cost 
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Figure  3:  Three- Version  Conditional  Probability  of  Detection 
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of  producing  multiple  versions  of  the  software.  However,  if  back-to-back  testing  is 
much  more  effective  then  the  cost  arguments  may  be  irrelevant. 

There  were  78  faults  that  were  detected  by  the  voting  procedure  that  were  not 
detected  by  any  other  technique.  Better  implementation  of  the  other  testing  tech- 
niques or  the  use  of  different  testing  techniques  might  have  found  more  faults,  but 
this  would  have  to  be  proven  either  way.  Even  given  the  novice  nature  of  the  par- 
ticipants in  the  other  testing  procedures,  they  did  find  153  faults  that  were  not 
detected  by  the  back- to-back  testing.  45  faults  were  detected  in  common.  There 
were  faults  that  did  not  cause  failures  on  the  randomly-generated  test  data  and 
therefore  could  not  possibly  have  been  detected  by  the  back-to-back  testing,  but 
were  found  by  the  techniques  that  do  not  require  failure  to  detect  faults. 

Our  data  suggests  that  using  back-to-back  testing  on  randomly-generated  data  is 
not  an  acceptable  testing  procedure  by  itself.  A  related  question  is  whether  better 
results  are  obtained  by  doing  the  back-to-back  testing  on  both  randomly-generated 
test  cases  and  functionally-generated  test  cases.  This  separates  the  issue  of  test 
data  generation  from  the  issue  of  using  voting  as  a  test  oracle.  We  executed  the  56 
triplets  on  the  functionally-generated  and  structurally-generated  test  cases  and  did 
not  detect  any  additional  faults.  This  implies  that  the  problem  is  not  necessarily  in 
the  test  case  generation  method,  but  in  the  identification  of  errors  by  voting,  i.e., 
by  the  limitations  of  using  voting  as  a  test  oracle. 

4.2.5     Types  of  Faults  Detected 

To  examine  the  fault  detection  behavior  of  the  techniques  further,  the  types  of  faults 
detected  by  each  were  profiled  and  compared.  Because  there  is  no  widely  accepted 
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detailed  taxonomy  for  fault  classification,  a  13  class  fault  taxonomy  was  developed 
and  used.  This  fault  taxonomy  is  described  in  table  5.  Since  this  taxonomy  was 
developed  to  differentiate  between  the  sets  of  actual  faults  detected  by  the  various 
techniques  (i.e.,  developed  after  examining  the  faults  in  an  attempt  to  distinguish 
them),  significance  tests  on  these  classifications  are  inappropriate.  In  the  following 
discussion,  a  statement  that  a  technique  detected  faults  in  a  particular  class  does  not 
imply  that  the  technique  detected  all  faults  of  that  class.  For  example,  the  statement 
that  voting  detected  missing-thread  faults  should  not  be  interpreted  as  indicating 
that  all  missing-thread  faults  located  in  the  versions  were  detected  by  voting.  Two 
of  the  categories  in  this  fault  taxonomy  (Overrestriction  faults  and  Data-Structure 
faults)  are  not  mentioned  in  the  text  that  follows  since  these  categories  do  not 
provide  a  basis  for  characterizing  the  differences  observed  in  fault  detection  by  the 
various  techniques. 
Code  Reading  by  Stepwise  Abstraction 

Code  reading  by  stepwise  abstraction  on  uncommented  code  found  calculation 
faults,  missing-check  faults,  branch-condition  faults  and  missing-branch  faults.  The 
participants  did  not  find  large  global  pieces  of  missing  code  or  missing  threads  of 
logic  that  ran  through  the  entire  program. 

Analysis  of  the  experiment  data  lent  insight  into  two  questions  related  to  the 
use  of  code  reading  in  software  development.  The  first  of  these  questions  is  what 
conditions  lead  code  reading  to  fail  to  detect  faults.  While  code  reading  detected 
a  number  of  faults,  it  failed  to  detect  the  majority  of  faults.  Identification  of  the 
causes  of  this  non-detection  suggests  ways  to  improve  code  reading  to  broaden  its 
effectiveness.  One  reason  for  the  failure  of  code  reading  to  detect  certain  faults  was 
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Class 

Comments 

Detecting  Technique 

Overrestriction 

E.g.,  forcing  all  weather  to  move 
north-east,  rejecting  legal  input 

Assert,  Read, 
Test,  Vote 

Loop  Condition 

E.g.,  infinite  loop 

Vote,  Assert,  Test 

Calculation 

Incorrect  formula 

Read 

Initialization 

Variable  not  initialized 

Stat.  Analysis,  Test 

Substitution 

Wrong  variable  used 

Vote,  Assert 

Missing  check 

Exceptional  case  not  handled 
E.g.,  divide  by  zero 

Read 

Branch  Condition 

Bad  condition  on  a  branch 

Vote,  Read,  Test 

Missing  Branch 

Localized  missing  code  to  detect 
and  handle  specific  conditions 
in  normal  execution 

Read,  Test 

Missing  Thread 

Missing  path  throughout 
program 

Vote,  Test 

Unimplemented 
Requirement 

Missing  functionality  on 
all  paths 

Test 

Ordering 

Operations  in  wrong  order 
(e.g.,  updating  value  before  use) 

Vote,  Test 

Parameter  Reversal 

Actual  parameter  order  permuted 
with  respect  to  formal  parameter 

Vote,  Assert 

Data  Structure 

E.g.,  linked  list  becomes  circular 

Vote,  Test, 
Read,  Assert 

Table  5:  Fault  Taxonomy 
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the  omission  of  needed  detail  in  the  abstractions  constructed  by  the  participants. 
Another  condition  that  seemed  difficult  to  detect  was  missing  code.  The  code- 
reading  participants  detected  missing-branch  faults  (i.e.,  where  the  set  of  cases 
handled  by  the  code  did  not  cover  all  possible  cases  at  a  specific  point),  but  failed 
to  detect  those  cases  where  larger  or  more  widespread  code  was  omitted. 

A  second  question  of  interest  in  examining  the  code- reading  results  is  what  condi- 
tions lead  the  code-reading  participants  to  erroneously  report  code  as  faulty.  These 
erroneous  reports  seemed  to  occur  when  the  readers  were  mislead  by  the  code.  Pre- 
venting such  misdirection  may  help  to  improve  the  detection  performance  of  code 
reading.  Use  of  commented  code  for  the  code  reading  might  have  prevented  some  of 
this  misdirection  of  the  readers,  however  it  might  also  have  lead  to  different  types 
of  misdirection.  For  example,  it  is  equally  possible  that  the  readers  would  report 
code  as  erroneous  when  it  conflicted  with  the  comments  (e.g.,  where  the  develop- 
ment participants  corrected  the  code,  but  not  the  comments  describing  the  code). 
Another  possibility  is  that  the  readers  would  have  failed  to  detect  as  many  faults, 
e.g.,  if  they  summarized  the  comments  instead  of  the  code  and  thus  duplicated 
the  faulty  assumptions  made  by  the  development  participants.  In  an  effort  to  avoid 
these  problems,  as  well  as  the  inherent  inequality  of  information  based  on  the  widely 
varying  amounts  of  comments  in  the  version  source  code  (see  table  1),  all  comments 
were  stripped  from  the  versions  before  the  readers  were  given  them. 

Analysis  of  the  annotations  written  by  the  code  readers  indicates  that  false  alarms 
arose  from  code  that  was  difficult  to  abstract.  For  example,  an  erroneous  fault  report 
was  generated  for  a  procedure  with  a  large  number  of  arguments  that  was  called 
in  several  places  in  the  code.  Some  of  the  formal  parameters  were  used  in  different 
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ways  in  the  procedure  (depending  on  the  value  of  other  formal  parameters),  and 
this  led  to  misconceptions  on  the  part  of  the  reader.  False  alarms  also  occurred 
when  abandoned  implementation  strategies  (blind  alleys  during  development)  are 
reflected  in  the  code.  For  example,  the  readers  erroneously  reported  several  faults  in 
cases  where  the  name  of  a  variable  conflicts  with  the  manner  in  which  the  variable  is 
used.  Annotations  by  the  code  readers  in  such  situations  indicate  that  they  focused 
on  syntactic  factors  rather  than  the  program  semantics.  These  results  suggest  that 
experience  and  improved  training  may  help  reduce  erroneous  reports  from  code 
reading. 
Static  Data-Flow  Analysis 

Static  data-flow  analysis  found  only  initialization  faults.  Three  faults  were  found 
by  this  technique  that  were  not  detected  by  any  other.  Upon  examination,  it  was 
determined  that  the  compiler  and  operating  system  versions  being  used  happened  to 
initialize  to  zero  the  particular  storage  locations  where  the  programs  were  loaded, 
and  these  variables  were  used  for  counters  and  needed  to  be  initialized  to  zero. 
Obviously,  this  cannot  be  counted  on  in  future  versions  of  these  support  programs 
so  these  are  real  and  important  faults  to  detect. 
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Voting 

Voting  found  missing- thread  faults,  parameter-reversal  faults,  substitution  faults, 
ordering  faults  and  faults  (subsets  of  loop-condition  and  data-structure  faults)  caus- 
ing abends  (which,  despite  their  cause,  are  obviously  found  by  any  of  the  techniques 
that  involve  executing  the  code  over  a  large  number  of  test  cases). 

It  is  interesting  to  consider  the  faults  that  were  not  found  by  voting,  i.e.,  those 
that  were  so  highly  correlated  that  the  faults  were  masked  by  the  voting  procedure. 
For  the  most  part,  these  were  missing-branch  faults.  This  is  consistent  with  past  ex- 
periments, which  have  all  reported  that  missing-logic  errors  are  poorly  tolerated  by 
multi-version  systems.  Testing  strategies,  such  as  functional  and  structural  testing, 
that  examine  special  cases  as  well  as  typical  cases  were  more  successful  at  finding 
missing-branch  faults.  As  discussed  above,  performing  back-to-back  testing  on  the 
test  cases  derived  for  functional  testing  did  not  solve  the  problem  since  the  common 
faults  masked  the  identification  of  the  fault  even  though  the  programs  failed. 

Another  unmasked  fault  involved  the  use  of  a  wrong  subscript.  This  is  puzzling 
as  the  same  thing  happened  in  a  previous  experiment  [10].    We  cannot  currently 
find  any  other  explanation  aside  from  coincidence  although  we  are  in  the  process  of 
attempting  to  determine  if  an  explanation  exists. 
Run-Time  Assertions 

Run-time  assertions  found  parameter- reversal  faults,  substitution  faults  and 
faults  causing  abends.     They  did  not  detect  any  of  the  four  classes  of  missing- 
code  faults.     We  are  not   very  confident  about  the  data  for  run-time  assertions 
as  the  programmers  involved  did  not  have  any  experience  in  writing  exception 
or  error-detection  code,  and  our  subjective  evaluation  of  their  assertions  is  that 
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they  were,  in  general,  quite  poor.  All  of  the  run-time  assertions  used  were  sim- 
ple range  or  specific  value  tests  (e.g.,  run-time  assertions  to  check  if  the  variable 
Params  .NumWeatherEvents  lies  between  0  and  the  constant  MaxWeat her,  or  if  point- 
ers are  non-nil).  In  fact,  examination  of  the  design  documents  show  cases  where  the 
development  participants  anticipated  faults  that  actually  occurred  in  their  code, 
but  (for  reasons  known  only  to  them)  they  omitted  assertions  to  check  for  these 
faults. 

Each  pair  of  development  participants  added  assertions  to  their  version  to  check 
if  battalions  left  the  simulated  battlefield,  and  these  assertions  detected  the  errors 
that  were  generated  when  this  occurred.  Two  of  the  eight  versions  placed  range 
restrictions  on  internally-calculated  values,  and  further  faults  were  detected  by  these 
assertions  in  each  of  these  versions. 

The  simple  check  strategy  used  and  the  failure  by  the  majority  of  the  development 
teams  to  place  assertions  to  check  internal  results  left  many  errors  undetected. 
However,  despite  these  weaknesses  simple  range  checks  detected  faults.  These  checks 
detected  23  faults  that  were  found  in  no  other  manner.  The  fact  that  assertions 
detected  faults  that  voting  did  not  is  consistent  with  the  results  of  our  previous 
study  of  assertion  effectiveness [24].  It  appears  that  even  a  cursory  set  of  assertions 
has  some  value,  and  this  suggests  that  it  would  be  useful  to  perform  further  work  to 
examine  the  effectiveness  of  a  thorough  set  of  assertions  for  fault  detection  (possibly 
using  an  automated  strategy). 
Functional  and  Structural  Testing 

Functional  and  structural  testing  identified  ordering  faults,  missing-branch  faults, 
unimplemented-requirement  faults  and  missing-thread  faults.    They  also  detected 
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faults  causing  abends  (as  did  all  the  techniques  that  involved  executing  the  pro- 
grams). Structural  testing  detected  further  missing-code  faults,  in  particular  faults 
involving  variables  that  were  initialized  in  a  manner  that  in  rare  cases  conflicted 
with  the  manner  in  which  those  variables  were  used. 

Structural  testing  failed  to  detect  several  missing-thread  faults  that  were  found 
by  other  techniques  (such  as  voting).  The  incompleteness  seemed  largely  due  to 
the  module-by- module  nature  of  the  testing  tool  used.  That  is,  the  prototype 
version  of  ASSET  used  in  this  experiment  measures  the  coverage  achieved  by  the 
input  data  on  each  module  individually,  with  no  consideration  of  data  flow  between 
modules.  The  versions  contain  several  instances  where  global  data  structures  are 
initialized  in  one  module,  updated  in  a  second  module  and  used  in  calculations  in 
several  other  modules.  While  all  of  the  initialization  paths  and  all  of  the  update 
paths  are  covered  by  the  test  data,  not  all  of  the  update  paths  are  covered  for  each 
initialization  path.  Therefore,  several  of  the  missing-code  faults  eluded  detection  in 
our  structural  testing. 

Examination  of  the  functionally-specified  test  data  sets  showed  that  faults  were 
revealed  only  by  those  data  sets  that  contained  atypical  data  (i.e.,  those  tests  that 
exercised  special  cases  in  the  versions  or  odd  combinations  of  the  functions  sup- 
ported by  the  code).  This  result  supports  a  recommended  practice  in  the  field  of 
software  testing. 
Additional  Comparisons  of  Fault  Detection  Techniques 

Two  general  attributes  accounted  for  much  of  the  observed  variation  of  effective- 
ness: the  ability  of  the  techniques  to  examine  internal  states  and  the  scope  of  their 
evaluation. 
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One  reason  voting  failed  to  detect  some  faults  was  that  it  was  not  able  to  examine 
internal  program  states.  The  other  techniques  do  not  share  this  limitation.  For  code 
reading  and  static  analysis,  examination  of  the  internal  state  involves  evaluation 
of  the  program  source  code.  Functional  testing  identifies  and  evaluates  internal 
abstract  functions.  Assertions  evaluate  specific  internal  conditions  at  the  locations 
where  they  are  inserted.  Because  the  voting  systems  examine  only  final  states,  they 
fail  to  identify  faults  that  occur,  but  are  concealed  by  later  processing. 

Tso  and  others  argue  that  voting  may  be  performed  on  internal  program  states, 
as  in  the  cross-check  analysis  technique[32].  However,  the  programs  in  this  ex- 
periment are  quite  diverse.  The  internal  program  states  differ  significantly  in  the 
algorithms  and  data  structures  employed.  A  single  value  in  the  internal  state  of 
one  program  may  indeed  be  a  single  value  in  another  program,  but  more  often  it  is 
either  a  function  of  several  values  or  not  present  at  all  (unneeded  in  the  alternate 
algorithm  used  in  the  second  program).  Furthermore,  since  the  order  of  the  pro- 
grams' operations  are  also  quite  diverse,  there  is  no  single  time  except  initialization 
and  production  of  the  final  result  at  which  any  correspondence  in  values  could  be 
compared  by  voting.  To  allow  voting  on  internal  program  states  requires  specifi- 
cation of  the  algorithm  and  data  structures  used  in  the  internal  states,  effectively 
eliminating  any  significant  design  diversity  and  thus  eliminating  the  ability  to  detect 
design  errors. 

A  second  important  attribute  is  the  scope  of  evaluation.  The  scope  over  which 
assertions  and  code  reading  examine  the  system  state  appears  to  be  the  key  char- 
acteristic limiting  the  detection  of  faults  by  those  techniques.  Assertions  examine 
the  system  state  at  specific  points  in  the  execution.  If  a  fault  has  not  yet  occurred 
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at  those  points,  or  if  the  fault's  effect  is  masked,  the  assertion  does  not  detect  the 
fault.  The  use  of  code  reading  by  stepwise  abstraction  on  the  uncommented  form 
of  these  programs  did  not  detect  certain  faults  because  the  process  of  abstraction 
did  not  maintain  sufficient  detail.  For  example,  an  assumption  by  the  coder  might 
be  violated  by  the  faulty  code,  but  the  assumption  is  not  preserved  through  several 
layers  of  abstractions  made  by  the  reader  between  the  initialization  and  the  calcu- 
lation code.  The  purpose  of  abstraction  is  to  keep  the  amount  of  information  at  a 
manageable  level,  but  over-abstraction  limits  the  effectiveness  of  code  reading. 

5      Conclusions 

It  is  important  to  consider  several  caveats  when  drawing  conclusions  from  the  data 
presented  in  this  paper.  First,  experts  in  the  various  techniques  were  not  used. 
Students  get  a  lot  of  experience  in  programming  while  in  school,  but  they  seldom 
receive  adequate  exposure  to  and  practice  with  testing  and  other  fault-elimination 
techniques.  We  gave  them  training,  but  that  is  not  a  substitute  for  experience. 
Furthermore,  only  one  method  was  applied  within  each  category  of  fault-elimination 
techniques;  the  particular  method  chosen  may  not  have  been  the  most  effective. 
Finally,  our  program  may  not  be  representative  of  a  large  number  of  applications 
and  the  particular  software  development  procedures  also  may  not  be  representative. 
Despite  these  limitations  (which  unfortunately  are  inherent  in  this  type  of  exper- 
imentation), useful  information  can  be  derived  from  this  study.  In  the  few  instances 
where  there  is  other  experimental  evidence,  our  results  tend  to  support  and  confirm 
previous  findings.  Where  almost  no  experimental  evidence  is  available,  our  results 
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represent  one  data  point  that  can  be  used  to  focus  and  direct  future  experiments. 

This  experiment  is  the  first  to  investigate  the  relationship  between  fault-elim- 
ination techniques  and  software  fault  tolerance.  We  found  that  our  data  does  not 
support  the  hypotheses  that  multi-version  voting  is  a  substitute  for  functional  test- 
ing, that  testing  can  be  reduced  when  using  this  software  fault-tolerance  technique, 
nor  that  testing  can  proceed  in  conjunction  with  operational  use  of  the  software 
in  an  n- version  programming  system  where  high  reliability  is  required.  Instead, 
we  found  that  multi-version  voting  did  not  tolerate  most  of  the  faults  detected  by 
the  fault-elimination  techniques.  Although  we  also  found  that  multi-version  voting 
tolerated  different  faults  than  were  detected  by  the  fault-elimination  techniques,  no 
firm  conclusions  should  be  drawn  from  this  because  of  doubts  about  the  ability  of 
the  novices  involved  and  the  limitations  of  the  fault  elimination  techniques  used; 
further  investigation  is  suggested. 

This  experiment  examines  a  broad  set  of  fault  detection  techniques  in  a  compar- 
ative manner.  While  the  presence  of  multiple  versions  can  speed  the  execution  of 
large  numbers  of  randomly  generated  cases,  our  results  cast  doubt  on  the  effective- 
ness of  using  voting  as  a  test  oracle.  Testing  procedures  that  allow  instrumenting 
the  code  to  examine  internal  states  were  much  more  effective.  When  comparing 
fault-elimination  methods,  we  found  that  the  intersection  of  the  sets  of  faults  found 
by  each  method  was  relatively  small.  Examination  of  the  faults  allowed  us  to  cat- 
egorize the  types  found  by  each  method  and,  in  some  cases,  to  explain  why  these 
results  occurred. 

This  experiment  raises  questions  with  respect  to  several  of  the  techniques  exam- 
ined. The  detection  capability  of  code  reading  appears  to  be  reduced  in  comparison 
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to  earlier  results  such  as  those  reported  by  Basili  and  Selby[6]  (who  used  smaller 
programs).  Additional  research  is  needed  to  distinguish  the  effects  of  program  size 
and  complexity  on  the  effectiveness  of  code  reading.  Analysis  of  the  faults  not  de- 
tected shows  that  there  is  a  need  to  develop  extensions  to  code  reading  techniques 
that  better  characterize  global  effects.  One  way  of  accomplishing  this  might  be  to 
mix  a  top-down  code  reading  technique  with  the  bottom-up  methodology  of  code 
reading  by  stepwise  abstraction.  Further  investigation  of  this  seems  worthwhile. 

The  static  data  flow  analysis  technique  used  in  this  study  is  limited  in  the  type 
of  faults  it  can  potentially  detect.  However,  several  of  the  faults  found  by  this  tech- 
nique were  found  by  no  other  technique,  and  so  applying  it  in  software  development 
may  be  worthwhile,  particularly  given  its  relatively  low  cost  of  application.  There 
ma\-  also  be  language  or  environmental  factors  that  reduced  the  number  of  undefined 
reference  faults  in  this  particular  software.  For  example,  the  requirement  of  declar- 
ing all  variables  in  Pascal  may  serve  as  a  reminder  to  initialize  variables  before  use. 
Other  static  analysis  techniques,  such  as  associating  physical  units  with  variable 
values  and  analyzing  the  software  to  see  if  the  units  are  appropriately  preserved[21] 
deserve  further  exploration.  These  types  of  static  analysis  techniques  would  permit 
examination  of  the  legality  of  usage  rather  than  just  the  presence  of  initialization 
and  reference. 
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